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Abstract: We present a new active learning algorithm based on non- 
parametric estimators of the regression function. Our investigation pro- 
vides probabilistic bounds for the rates of convergence of the generaliza- 
tion error achievable by proposed method over a broad class of underly- 
ing distributions. We also prove minimax lower bounds which show that 
the obtained rates are almost tight. 
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1. Introduction 

Let (S,B) be a measurable space and let (X, Y) G S x {—1,1} be a ran- 
dom couple with unknown distribution P. The marginal distribution of the 
design variable X will be denoted by IT. Let rj(x) := E(y|X = x) be the 
regression function. The goal of binary classification is to predict label Y 
based on the observation X. Prediction is based on a classifier - a measur- 
able function / : S i— > { — 1,1}. The quality of a classifier is measured in 
terms of its generalization error, R(f) = Pr (Y ^ f(X)). In practice, the 
distribution P remains unknown but the learning algorithm has access to 
the training data - the i.i.d. sample (Xj, Yi), i = 1 . . . n from P. It often hap- 
pens that the cost of obtaining the training data is associated with labeling 
the observations Xi while the pool of observations itself is almost unlimited. 
This suggests to measure the performance of a learning algorithm in terms 
of its label complexity, the number of labels Yi required to obtain a classifier 
with the desired accuracy. Active learning theory is mainly devoted to de- 
sign and analysis of the algorithms that can take advantage of this modified 
framework. Most of these procedures can be characterized by the following 
property: at each step k, observation Xu is sampled from a distribution 11^ 
that depends on previously obtained (Xi, Yi), i < k — 1 (while passive learn- 
ers obtain all available training data at the same time). LT^ is designed to be 
supported on a set where classification is difficult and requires more labeled 
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data to be collected. The situation when active learners outperform passive 
algorithms might occur when the so-called Tsybakov's low noise assumption 
is satisfied: there exist constants B, 7 > such that 

V t > 0, U(x : \v(x)\ < t) < BP (1.1) 

This assumption provides a convenient way to characterize the noise level 
of the problem and will play a crucial role in our investigation. 
The topic of active learning is widely present in the literature; see Balcan 
et al. [3], Hanneke [7], Castro and Nowak [4] for review. It was discovered 
that in some cases the generalization error of a resulting classifier can con- 
verge to zero exponentially fast with respect to its label complexity (while the 
best rate for passive learning is usually polynomial with respect to the cardi- 
nality of the training data set). However, available algorithms that adapt to 
the unknown parameters of the problem(7 in Tsybakov's low noise assump- 
tion, regularity of the decision boundary) involve empirical risk minimization 
with binary loss, along with other computationally hard problems, see Bal- 
can et al. [2], Hanneke [7]. On the other hand, the algorithms that can be 
effectively implemented, as in Castro and Nowak [4], are not adaptive. 
The majority of the previous work in the field was done under standard 
complexity assumptions on the set of possible classifiers (such as polyno- 
mial growth of the covering numbers). Castro and Nowak [4] derived their 
results under the regularity conditions on the decision boundary and the 
noise assumption which is slightly more restrictive then (1.1). Essentially, 
they proved that if the decision boundary is a graph of the Holder smooth 
function g G T<((3,K, [0,l] d ~ l ) (see section 2 for definitions) and the noise 
assumption is satisfied with 7 > 0, then the minimax lower bound for the 

0(1+7) 

expected excess risk of the active classifier is of order C ■ N 20+7M-1) and the 

0(1 + 7) 

upper bound is C(N/ log N) w+~i(.d-i) ^ w here N is the label budget. How- 
ever, the construction of the classifier that achieves an upper bound assumes 
(3 and 7 to be known. 

In this paper, we consider the problem of active learning under classical 
nonparametric assumptions on the regression function - namely, we assume 
that it belongs to a certain Holder class K, [0, l] d ) and satisfies to the 
low noise condition (1.1) with some positive 7. In this case, the work of Au- 
dibert and Tsybakov [1] showed that plug-in classifiers can attain optimal 
rates in the passive learning framework, namely, that the expected excess 

0(1+7) 

risk of a classifier g = sign 17 is bounded above by CN 2 P+ d (which is the 
optimal rate), where r) is the local polynomial estimator of the regression 
function and N is the size of the training data set. We were able to partially 
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extend this claim to the case of active learning: first, we obtain minimax 
lower bounds for the excess risk of an active classifier in terms of its label 
complexity. Second, we propose a new algorithm that is based on plug-in 
classifiers, attains almost optimal rates over a broad class of distributions 
and possesses adaptivity with respect to /3, 7 (within the certain range of 
these parameters). 

The paper is organized as follows: the next section introduces remaining 
notations and specifies the main assumptions made throughout the paper. 
This is followed by a qualitative description of our learning algorithm. The 
second part of the work contains the statements and proofs of our main 
results - minimax upper and lower bounds for the excess risk. 

2. Preliminaries 

Our active learning framework is governed by the following rules: 

1. Observations are sampled sequentially: is sampled from the modi- 
fied distribution 11^ that depends on (Xi, Yi), . . . , (X^_i, Yfc_i). 

2. Yfc is sampled from the conditional distribution Py\x('\X = x). Labels 
are conditionally independent given the feature vectors Xi, i < n. 

Usually, the distribution 11^ is supported on a set where classification is 



Given the probability measure Q on S x {—1,1}, we denote the integral 
with respect to this measure by Qg := J gdQ. Let J 7 be a class of bounded, 
measurable functions. The risk and the excess risk of / G T with respect to 
the measure Q are defined by 



where X4 is the indicator of event A. We will omit the subindex Q when the 
underlying measure is clear from the context. Recall that we denoted the 
distribution of (X, Y) by P. The minimal possible risk with respect to P is 



where the infimum is taken over all measurable functions. It is well known 
that it is attained for any g such that sign g(x) = sign rj(x) II - a.s. Given 
g G J 7 , A £ B, 5 > 0, define 



difficult. 



-Rq(/) : ~ Q^y^sign f(x) 

Eq{f) := Rq(f) - inf Rq(g), 



R* = inf Pr 

g:5M.[-l,l] 




Foo,a{9',5) ■= {/ G ^ : 11/ - S'lloc.A 



<<*} 
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where ||/ — ff||oo,A = sup \ f(x) — g(x)\. For A £ B, define the function class 

xeA 

T\a := {/U, / € F] 

where /\a(x) := f(x)lA(x). From now on, we restrict our attention to the 
case S = [0, l] d . Let K > 0. 

Definition 2.1. We say that g : R d i-> R belongs to E(/3, K, [0, l] d ), the 
(f3, K, [0, l] d ) - Holder class of functions, if g is [f3\ times continuously dif- 
ferentiate and for all x,x\ G [0, l] d satisfies 

\g( Xl ) - T^x^l < KWx-x^, 

where T x is the Taylor polynomial of degree \ L (3\ of g at the point x. 

Definition 2.2. V(/3,j) is the class of probability distributions on 
[0, l] d x { — 1, +1} with the following properties: 

1. V t > 0, U(x : \tj(x)\ <t)< Bf; 

2. ^)eE(p,[o,i] d ). 

We do not mention the dependence ofV(/3, 7) on the fixed constants B, K 
explicitly, but this should not cause any uncertainty. 

Finally, let us define T , l J (f3,'y) and VuiPil), the subclasses of V(f3, 7), by 
imposing two additional assumptions. Along with the formal descriptions of 
these assumptions, we shall try to provide some motivation behind them. 
The first deals with the marginal IT. For an integer M > 1, let 

««:-{(£.■•..£). <-!■••<«} 

be the regular grid on the unit cube [0, l] d with mesh size M" 1 . It naturally 
defines a partition into a set of M d open cubes Ri, i = 1 . . . M d with edges 
of length M _1 and vertices in Qm- Below, we consider the nested sequence 
of grids {£?2 m 5 m > 1} and corresponding dyadic partitions of the unit cube. 

Definition 2.3. We will say that IT is (ui, U2) -regular with respect to {Q2 m } 
if for any m > 1, any element of the partition Ri, i < 2 dm such that 
Ri fl supp(II) / ; we have 

ui -2~ dm < U(Ri) < u 2 -2~ dm . (2.1) 

where < u\ < u 2 < 00. 

Assumption 1. II is (u\,U2) - regular. 



S. Minsker/ 'Plug-in Approach 



5 



In particular, {u\ , "^-regularity holds for the distribution with a density 
p on [0, l] d such that < u\ < p(x) < U2 < oo. 

Let us mention that our definition of regularity is of rather technical nature; 
for most of the paper, the reader might think of II as being uniform on 
[0, l] d ( however, we need slightly more complicated marginal to construct 
the minimax lower bounds for the excess risk). It is know that estimation 
of regression function in sup-norm is sensitive to the geometry of design 
distribution, mainly because the quality of estimation depends on the local 
amount of data at every point; conditions similar to our assumption 1 were 
used in the previous works where this problem appeared, e.g., strong density 
assumption in Audibert and Tsybakov [1] and assumption D in Gaiffas [5]. 
Another useful characteristic of {u\,U2) - regular distribution II is that this 
property is stable with respect to restrictions of IT to certain subsets of its 
support. This fact fits the active learning framework particularly well. 

Definition 2.4. We say that Q belongs to Vu{P,l) if Q £ and 
assumption 1 is satisfied for some u\,U2- 

The second assumption is crucial in derivation of the upper bounds. The 
space of piecewise-constant functions which is used to construct the estima- 
tors of rj(x) is defined via 

{ 2 dm \ 
J^XJr^): M<1, i = l...2 dm \, 

where {Ri} i=1 forms the dyadic partition of the unit cube. Note that J- m 
can be viewed as a || • ||oo-unit ball in the linear span of first 2 dm Haar basis 
functions in [0, l] d . Moreover, {J- m , m > 1} is a nested family, which is a 
desirable property for the model selection procedures. By f] m (x) we denote 
the L2(II) - projection of the regression function onto T m . 
We will say that the set A C [0, l] d approximates the decision boundary 
{x : rj(x) = 0} if there exists t > such that 

{x : \n(x)\ < t} u a n C{x: \ V (x)\ < 3t} u , (2.2) 

where for any set A we define An := A D supp(II). The most important 
example we have in mind is the following: let 77 be some estimator of r\ with 
\\v ~ ^lloo.supp^) ^ t> an d define the 2t - band around r/ by 

F = j/ : fj(x) -2t< f(x) < fj(x) + 2t Vx e [0, 
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Take A = <^x : 3fi, f2 G F s.t. sign f\{x) ^ sign f2(x) j, then it is easy to 

see that A satisfies (2.2). Modified design distributions used by our algorithm 

are supported on the sets with similar structure. 

Let cr(Fm) be the sigma-algebra generated by T m and A G a(F rn ). 

Assumption 2. There exists B2 > such that for all m > 1, AG a{J- m ) 
satisfying (2.2) and such that An 7^ the following holds true: 



Appearance of assumption 2 is motivated by the structure of our learning 
algorithm - namely, it is based on adaptive confidence bands for the regres- 
sion function. Nonparametric confidence bands is a big topic in statistical 
literature, and the review of this subject is not our goal. We just mention 
that it is impossible to construct adaptive confidence bands of optimal size 



over the whole \J £ {p,K, [0, l] d ). Hoffmann and Nickl [8], Low [11] dis- 



cuss the subject in details. However, it is possible to construct adaptive L2 - 
confidence balls(see an example following Theorem 6.1 in Koltchinskii [10]). 
For functions satisfying assumption 2, this fact allows to obtain confidence 
bands of desired size. In particular, 

(a) functions that are differentiable, with gradient being bounded away from 
in the vicinity of decision boundary; 

(b) Lipschitz continuous functions that are convex in the vicinity of decision 
boundary 

satisfy assumption 2. For precise statements, see Propositions A.l, A. 2 in 
Appendix A. A different approach to adaptive confidence bands in case 
of one-dimensional density estimation is presented in Gine and Nickl [6]. 
Finally, we define Pu(i3,j): 

Definition 2.5. We say that Q belongs to Vu((3,j) if Q G Vu(P,"i) and 
assumption 2 is satisfied for some B2 > 0. 

2.1. Learning algorithm 

Now we give a brief description of the algorithm, since several definitions 
appear naturally in this context. First, let us emphasize that the marginal 
distribution n is assumed to be known to the learner. This is not a restric- 
tion, since we are not limited in the use of unlabeled data and n can be 




[0,1] d 



/8<1 
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estimated to any desired accuracy. Our construction is based on so-called 
plug-in classifiers of the form /(•) = sign fj(-), where i) is a piecewise-constant 
estimator of the regression function. As we have already mentioned above, 
it was shown in Audibert and Tsybakov [1] that in the passive learning 

framework plug-in classifiers attain optimal rate for the excess risk of order 

gq+7) 

N 2 $+ d , with fj being the local polynomial estimator. 

Our active learning algorithm iteratively improves the classifier by con- 
structing shrinking confidence bands for the regression function. On every 
step k, the piecewise-constant estimator fjk is obtained via the model se- 
lection procedure which allows adaptation to the unknown smoothness (for 
Holder exponent < 1). The estimator is further used to construct a confi- 
dence band JFk for rj(x). The active set assosiated with is defined as 

A k = A(P k ) := jx G supp(n) : 3/i, f 2 € Jfc,sign f^x) / sign / 2 (x)| 

Clearly, this is the set where the confidence band crosses zero level and where 
classification is potentially difficult. Ak serves as a support of the modified 
distribution tlk+i- on step k+1, label Y is requested only for observations 
X G Ak, forcing the labeled data to concentrate in the domain where higher 
precision is needed. This allows one to obtain a tighter confidence band for 
the regression function restricted to the active set. Since A& approaches the 
decision boundary, its size is controlled by the low noise assumption. The 
algorithm does not require a priori knowledge of the noise and regularity 
parameters, being adaptive for 7 > 0, /3 < 1. Further details are given in 




Fig 1. Active Learning Algorithm 



section 3.2. 
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2.2. Comparison inequalities 

Before proceeding to the main results, let us recall the well-known connec- 
tions between the binary risk and the || • ||oo, || • IU 2 (n) ~ norm risks: 

Proposition 2.1. Under the low noise assumption, 

Rp(f) ~R*< DxWif - n)X {sign / + sign r)} (2.3) 

Rp(f) ~R*< D 2 \\(f - r ? )X{sign / + sign r,} ||^; (2.4) 

Rptf) - R* > J D 3 n(sign / ^ sign V ) ^ (2.5) 

Proof. For (2.3) and (2.4), see Audibert and Tsybakov [1], lemmas 5.1, 5.2 
respectively, and for (2.5) — Koltchinskii [10], lemma 5.2. □ 



3. Main results 



The question we address below is: what are the best possible rates that can 
be achieved by active algorithms in our framework and how these rates can 
be attained. 



3.1. Minimax lower bounds for the excess risk 

The goal of this section is to prove that for P G no active learner 

can output a classifier with expected excess risk converging to zero faster 

,3(1+7) 

than N ^P+d-p-t . Our result builds upon the minimax bounds of Audibert 
and Tsybakov [1], Castro and Nowak [4]. 

Remark The theorem below is proved for a smaller class Pj/(/S,7), which 
implies the result for V((3,j). 

Theorem 3.1. Let (3, 7, d be such that /3~f < d. Then there exists C > 
such that for all n large enough and for any active classifier f n {x) we have 

sup ER P (f n ) - R* > CN~^+d-li 
P£PuW,-y) 

Proof. We proceed by constructing the appropriate family of classifiers f a (x) = 
sign rj a (x), in a way similar to Theorem 3.5 in Audibert and Tsybakov [1], 
and then apply Theorem 2.5 from Tsybakov [13]. We present it below for 
reader's convenience. 
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Theorem 3.2. Let £ be a class of models, d : S x £ i— ^ R - £/ie pseudometric 
and {Pf, / E £} - a collection of probability measures associated with S. 
Assume there exists a subset {/o, . . . , /m} 0/ £ such that 

1- d{ fi, fj) > 2s > VO < i < j < M 
2. Pf. < Pf for every 1 < j < M 



3- i f Et=i KL (^' P /o)<«logM, 0<a<| 



Then 



infsup P f (d(/,/)> > VA f_ ( l-2a-J- 2a - 7 
f /es ^ ' _ 1 + \/M V V logM 

where the infimum is taken over all possible estimators of f based on a 
sample from Pf and KL(-, •) is the Kullback-Leibler divergence. 

Going back to the proof, let q = 2 l , I > 1 and 

JY 2fci-l 2fcrf-l\ . 

be the grid on [0, l] d . For x G [0, l] d , let 

n g (x) = argmin {||x - x k \\ 2 : x fc G G 9 } 

If n q (x) is not unique, we choose the one with smallest || • H2 norm. The unit 
cube is partitioned with respect to G q as follows: xi,x 2 belong to the same 
subset if n q (x\) = n q (x2}- Let ' >-' be some order on the elements of G q such 
that x y y implies ||x||2 > \\y\\2- Assume that the elements of the partition 
are enumerated with respect to the order of their centers induced by ' >-': 

[0, l] d = (J B4. Fix 1 < m < q d and let 



i=i 



S := [j R l 



i=i 



Note that the partition is ordered in such a way that there always exists 
1 < k < q^/d with 

B + (^)cSCbJo,^A, (3.1) 



where -B+(0,i?) := {x G : ||x||2 < Rj. In other words, (3.1) means that 
that the difference between the radii of inscribed and circumscribed spherical 
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sectors of S is of order C(d)q^ 1 . 

Let v > r\ > T2 be three integers satisfying 

2~ v < 2~ n < 2- ri Vd < 2- r2 Vd < 2- 1 (3.2) 

Define u(x) : R i-> R + by 

u(x) := t /2 U (3.3) 
/ U(t)dt 

2~v 



where 

£/■(<) : = / 6XP {' (l/2-x)\x-2-v)) > x G ( 2 ^> 
else. 



D l ■ 



Note that u(x) is an infinitely diffferentiable function such that u(x) = 
1, x G [0, 2""] and u(x) = 0, z > |. Finally, for x G R d let 

*(a?) := Cu(||x|| 2 ) 

where C := C LijS is chosen such that $ G E(/3, L, R d ). 
Let r s := inf {r > : B+(0,r) 5 5"} and 

A := ||J Ri: Ri n 5+ (o, r 5 + g~^) = 0| 

Note that 

r S < c , (3.4) 

q 

since Vol (S) = mq~ d . 

Define H m = {P a : a G {—1, l}" 1 } to be the hypercube of probability distri- 
butions on [0, l) d x {-1, +1}. The marginal distribution II of X is indepen- 
dent of a: define its density p by 



p(x) = < 



2 ^-rZv x€B 00 [z,^-)\B 00 (z,^j, z€G q nS, 
c , x G A , 
else. 



where B^z.r) := {x : ||x-.z||oo < r}, c := VoiCA ) ( note tliat n (^) = 
g _a! Vi < m) and r\,V2 are defined in (3.2). In particular, II satisfies 
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Fig 2. Geometry of the support 



assumption 1 since it is supported on the union of dyadic cubes and has 
bounded above and below on supp(II) density. 
Let 



tf(s) :=u[l/2-qT6isb 2 (x,B + {Q,rs))), 



where «(•) is defined in (3.3) and dist2(x, A) := inf {\\x — y\\2, y G ^4}. 
Finally, the regression function r/ a (x) = Kp tT (Y\X = x) is defined via 



rj a (x) :-- 



q P$(q[x - n q (x)]), 



x G Ri, 1 < i < m 



C L ,fjVd 



dist 2 (x,B + (0,r s ))~< ■ xG [0, l] d \ S. 



The graph of r) a is a surface consisting of small "bumps" spread around S 
and tending away from monotonically with respect to dist2(-, B + (0, rs)) on 
[0, Clearly, rf a {x) satisfies smoothness requirement, since for x G [0, l] d 

dist 2 (», B + (0,rs)) = \\x\\ 2 - r s 

an d ~ > /3 by assumption. 1 Let's check that it also satisfies the low noise 
condition. Since \rj a \ > Cq^ 13 on support of II, it is enough to consider 



^(a;) can be replaced by 1 unless /?7 = d and /3 is an integer, in which case extra 
smoothness at the boundary of B+(0,rs), provided by ty, is necessary. 
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t = Czq~ 13 for z > 1: 

n(|j7,x(a;)| < Czq-P) < mq- d + II (dist 2 (x, B+(0, r s )) < Cz 7/ V^) < 

< mq- d + C 2 (r S + C^V^) < 

< mq- d + C 3 mq- d + C^q~^ < 

< CP, 

if mq~ d = 0{q~^). Here, the first inequality follows from considering r\ a 
on S and Aq separately, and second inequality follows from (3.4) and direct 
computation of the sphere volume. 

Finally, r\ a satisfies assumption 2 with some B 2 := B 2 (q) since on supp(II) 

< cx(q) < \\Vr]*(x)\\ 2 < c 2 (q) < oo 

The next step in the proof is to choose the subset of % which is "well- 
separated": this can be done due to the following fact(see Tsybakov [13], 
Lemma 2.9): 

Proposition 3.1 (Gilbert- Varshamov). Form > 8, there exists 

{a , . . .,a M } C {-1, l} m 

such that er = {1, 1, ... , 1}, p(cri, <Tj) > f V < i < k < M and M > 2 m / 8 
where p stands for the Hamming distance. 

Let %' := {P ao , ■ ■ ■ , Pa M } be chosen such that {oq, . . . , <tm} satisfies the 
proposition above. Next, following the proof of Theorems 1 and 3 in Castro 
and Nowak [4], we note that Vcr £ T-C , o~ ^ o~o 

KL(P CTiAr ||P CT0)J v) < 87V max ( Va (x) - Va (x)) 2 < 32C 2 Lp Nq- 2 ^ (3.5) 

26[Q,1] 

where P a ,N is the joint distribution of (Xi,Yi)f =l under hypothesis that the 
distribution of couple (X, Y) is P CT . Let us briefly sketch the derivation of 
(3.5); see also the proof of Theorem 1 in Castro and Nowak [4]. Denote 

Xk := (Xl, . . . , Xk), 
% := (Y lf ...,Y k ) 

Then dP a: N admits the following factorization: 

N 

dP a , N (X N ,Y N ) =Y[p a (Yi\X i )dP{X i \X i - lt Y ir . 1 ), 
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where dP(Xj|Xj_i, Y^-i) does not depend on a but only on the active learn- 
ing algorithm. As a consequence, 



KL(P a:N \\P ao , N ) = E Pa>N log Jn ' ,^ V; = E PctiJV log 

JV 



IT / i -fo-(^l X) I V 



< 



< N max E Pct ( log fH^.^H 1^1 = s| < 

< 8iV max (^(x) - r] ao (x)) 2 , 

xe[o,i] d 

where the last inequality follows from Lemma 1, Castro and Nowak [4]. Also, 
note that we have m&x x ^ 01 ^d in our bounds rather than the average over x 
that would appear in the passive learning framework. 

i 

It remains to choose q, m in appropriate way: set q ~ \C\N 2 P+ d -P-< J and m = 
[C 2 q d -^\ where d, C 2 are such that q d > m > 1 and 32Cl^Nq~ 2 P < § 
which is possible for N big enough. In particular, mq~ d = 0(q~^). Together 
with the bound (3.5), this gives 

jj £ KL^HP^o) < WtilNq-W < g = itogm 

so that conditions of Theorem 3.2 are satisfied. Setting 

f a {x) := sign ^(x), 
we finally have V<ti ^ a 2 € T~L' 

m _ Pi 

d(fn,fn) ■= n(sign r/ CT1 x / sign ^ x > — - > C 4 N W+s=m , 

eg' 1 

where the lower bound just follows by construction of our hypotheses. Since 

" ~ 1 + 7 

under the low noise assumption Rp(f n ) — R* > cll(f n / sign rf) ~> (see 
(2.5)), we conclude that 

/ * 13(1+1) \ 

inf sup Pr Rp(f n ) - R* > C±N~W+s=Jh > 
In PeVfj 08 )7 ) V / 

> inf sup Pr ( II(/ n (x) 7^ sign ??p(x)) > ^N~w+^j > r > 0. 



□ 
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3.2. Upper bounds for the excess risk 

Below, we present a new active learning algorithm which is computationally 
tractable, adaptive with respect to /3, 7(in a certain range of these param- 
eters) and can be applied in the nonparametric setting. We show that the 
classifier constructed by the algorithm attains the rates of Theorem 3.1, up 
to poly logarithmic factor, if < (3 < 1 and /?7 < d (the last condition covers 
the most interesting case when the regression function hits or crosses the 
decision boundary in the interior of the support of II; for detailed statement 
about the connection between the behavior of the regression function near 
the decision boundary with parameters f3, 7, see Proposition 3.4 in Audibert 
and Tsybakov [1]). The problem of adaptation to higher order of smoothness 
(j3 > 1) is still awaiting its complete solution; we address these questions 
below in our final remarks. 

For the purpose of this section, the regularity assumption reads as follows: 
there exists < f3 < 1 such that Vxi, x 2 £ [0, l] d 



Since we want to be able to construct non- asymptotic confidence bands, 
some estimates on the size of constants in (3.6) and assumption 2 are needed. 
Below, we will additionally assume that 



where N is the label budget. This can be replaced by any known bounds on 
Bi , B 2 ■ 

Let A G cr(F m ) with A n := A n supp(II) ^ 0. Define 



and d m := dim J- m \A n - Next, we introduce a simple estimator of the regres- 
sion function on the set ^n- Given the resolution level m and an iid sample 



\r](xi) - r](x 2 )\ < Bi\\xi - x 2 



00 



(3.6) 



B 1 < logiV 
B 2 > log" 1 N, 



TlA(dx) := U(dx\x G ^ n ) 



(X u Yi), i < N with Xi ~ ILa, let 



r)m,A\ x ) : = 



E 



N ■ U A {Ri) 



(3.7) 



Since we assumed that the marginal n is known, the estimator is well- 
defined. The following proposition provides the information about concen- 
tration of fj m around its mean: 
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Proposition 3.2. For all t > 0, 



/ , . . , ,. 2 dm U(A)\ 

Pr max|?7 mi A(x)-r? m (a;)| >t\ r- - < 

V leAn y uiiv / 



< 2c2 m exp 



-t 2 



2(1 + \ y/2 dm U{A)/ Ul N) ) ' 



Proof. This is a straightforward application of the Bernstein's inequality to 
the random variables 

N 

S N ■= Y, Y M X :>)> * € {i : ifc n A n ^ 0} , 

3=1 

and the union bound: indeed, note that ^YTr^Xj)) 2 = Il A (Ri), so that 

Nfl A (R t )t 2 \ 



Pr ^ - TV J r]dtl A > tNh A (Ri?j < 2 exp 



2 + 2t/3 



and the rest follows by simple algebra using that Il A (Ri) > 2^n(A) by the 
(ui, ^-regularity of II. □ 

Given a sequence of hypotheses classes Q m , m > 1, define the index set 



J(JV) := <| m G N : 1 < dim^ m < ^ [> (3.8) 



AT 
log^iV 



- the set of possible "resolution levels" of an estimator based on N classi- 
fied observations(an upper bound corresponds to the fact that we want the 
estimator to be consistent). When talking about model selection procedures 
below, we will implicitly assume that the model index is chosen from the cor- 
responding set J . The role of Q m will be played by T m \ A for appropriately 
chosen set A. We are now ready to present the active learning algorithm 
followed by its detailed analysis(see Table 1). 

Remark Note that on every iteration, Algorithm la uses the whole 
sample to select the resolution level m& and to build the estimator fjk ■ While 
being suitable for practical implementation, this is not convenient for theo- 
retical analysis. We will prove the upper bounds for a slighly modified ver- 
sion: namely, on every iteration k labeled data is divided into two subsamples 

5^1 and S 1 ^ of approximately equal size, \Sk t i\ c± l-S^I — ■ n(-Afc) 
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Algorithm la 

input label budget N; confidence a; 
rho = 0, Fo :— Fmg, r)o = 0; 
LB ■— N; If label budget 

N — 2^2 ^JVJ. 

s (fc) (m, N,a) := s(m,N,a) := m(logiV + log i); 
k:=0; 

while LB > do 

k := k + 1; 
N k := 2jV fc _ i; 

A h := {a; G [0, l] d : 3/!,/ 2 G A-i,sign (fox)) ± sign (/a (a;))} ; 

if A k n supp(n) = or lb < |_iv fc ■ n(i fc )J then 

break; output g := sign ffa-i 
else 

for * = 1 . . . [N k ■ n(i fe )J 

sample i.i.d (xf , F 4 (fc) ) with X t (fc) ~ n fe := Il(dx\x € A*); 
end for; 

LB-LB - LiV fe -n(i fc )J; 

Pk := [JVfc .n(A fc )j E^W y.CM // "active" empirical measure 

\- r f» (ir rris\\2 , 2 dm II( A k ) + s(m — mi. _ i ,N, a) 1 

m fc := argmm m > Afci [mf /6r „ P fc (F - /(X)) 2 + K x ^.^j -J 

% := Vrh h ,A h // see (3.7) 

Fk ■= {/ £ ^A fc : /U fc G -^oo^fe^), /Ip.ij^xlfc = »)fe-i|[o,i]'«\4fc}' 
end; 

Table 1 
Active Learning Algorithm 



Then & is used to select the resolution level rh^ and - to construct 
We will call this modified version Algorithm lb. 
As a first step towards the analysis of Algorithm lb, let us prove the 
useful fact about the general model selection scheme. Given an iid sample 
(Xi,Yi), i<N, set s m = m(s + log log 2 N), m > 1 and 



rh := fh(s) = argmin 



inf P N (Y-f(X)) 2 + Ki 



cydm I c 

N 



2 dm 

in : = min <j m > 1 : inf E(f(X) - r](X)) 2 < K 2 — 



(3.9) 
(3.10) 



Theorem 3.3. There exist an absolute constant K\ big enough such that, 
with probability > 1 — e~ s , 

rh < rh 
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Proof. See Appendix B. □ 

Straightforward application of this result immediately yields the following: 

Corollary 3.1. Suppose r)(x) E £(/?, L, [0, Then, with probability > 
1 - e~ s , 

2™ < Ci • N 1 ^ 3 
Proof. By definition of m, we have 

fh < 1 + max ^ m : inf E(/(X) - n(X)) 2 > K 2 } < 

I fer m N J 



< 1 + max <! m : L 2 2" 2/3m > if. 



and the claim follows. □ 

With this bound in hand, we are ready to formulate and prove the main 
result of this section: 

Theorem 3.4. Suppose that P G ^(At) with B 1 < log AT", B 2 > log -1 iV 
and /?7 < d. Then, with probability > 1 — 3a, £/te classifier g returned by 
Algorithm lb with label budget N satisfies 

£(1+7) jV 

- R* < Const • N~W+s=0^ log p — , 

a 

where p < 2p+d-Pi and B\, B 2 are the constants from (3.6) and assumption 
2. 

Remarks 

£(1+7) _ l 

1. Note that when /?7 > |, AT 2^+d-/3 7 j s a/ast rate, i.e., faster than 2 ; 

,3(1+7) , 

at the same time, the passive learning rate A" 2 P+ d is guaranteed to 
be fast only when (3-y > &, see Audibert and Tsybakov [1]. 

£(1+7) 

2. For a ~ A' zp+d-p-, Algorithm lb returns a classifier g& that satisfies 

£(1 + 7) 

ER P (g & ) - R* < Const • AT" i og P jy. 

This is a direct corollary of Theorem 3.4 and the inequality 
E\Z\ <t+\\Z\\ooPT(\Z\ > t) 
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Proof. Our main goal is to construct high probability bounds for the size of 
the active sets defined by Algorithm lb. In turn, these bounds depend on 
the size of the confidence bands for rj(x), and the previous result (Theorem 
3.3) is used to obtain the required estimates. Suppose L is the number of 
steps performed by the algorithm before termination; clearly, L < N. 
Let iV| ct := [N k ■ IL(Ak)\ be the number of labels requested on k-th. step of 
the algorithm: this choice guarantees that the " density" of labeled examples 
doubles on every step. 

Claim: the following bound for the size of the active set holds uniformly for 
all 2 < k < L with probability at least 1 — 2a: 



Il(A k )< CN k ^ +d (log -J (3.11; 



It is not hard to finish the proof assuming (3.11) is true: indeed, it implies 
that the number of labels requested on step k satisfies 

2j 3+d-/3 7 / at \ 27 

iVf* = [N k U(A k )\ <C-N k 2fi+d flog -J 
with probability > 1 — 2a. Since Yl -/V^ * < N, one easily deduces that on 

k 

the last iteration L we have 

2/3+ri 

N L > c — , 3.12 

To obtain the risk bound of the theorem from here, we apply inequality (2.3) 
2 from proposition 2.1: 

Rp(g)-R* < D l || (r} L - 77) • 1 {sign fj L ^ sign r]} (3.13) 

It remains to estimate — v\\oo A l : we wm snow below while proving (3.11) 
that 

\\VL-v\L,A L <C-N-^lo g ^ 

Together with (3.12) and (3.13), it implies the final result. 

To finish the proof, it remains to establish (3.11). Recall that f) k stands 
for the £2(11) - projection of rj onto Tm k - An important role in the argu- 
ment is played by the bound on the ^(Ilfc) - norm of the "bias" [f\ k — rj): 



2 alternatively, inequality (2.4) can be used but results in a slightly inferior logarithmic 
factor. 
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together with assumption 2, it allows to estimate \\fjk — vWoo \ ■ The required 
bound follows from the following oracle inequality: there exists an event B 
of probability > 1 — a such that on this event for every 1 < k < L 



\v k -v\\ 2 L(tl) < inf 



+ R 2 dm U(A k ) + (m - mfc-i) log(JV/q) ' 
1 N k U(A k ) 

It general form, this inequality is given by Theorem 6.1 in Koltchinskii [10] 
and provides the estimate for ||% — ^ll z, 2 (n fc ) ' so ^ automatically implies the 
weaker bound for the bias term only. To deduce (3.14), we use the mentioned 
general inequality L times(once for every iteration) and the union bound. 
The quantity 2 dm H(A k ) in (3.14) plays the role of the dimension, which is 
justified below. Let k > 1 be fixed. For m > rh^-i, consider hypothesis 
classes 

An obvious but important fact is that for P G T > u(f3,'j), the dimension of 
Tm\x ^ s bounded by u^ 1 ■ 2 m U(A k ): indeed, 

n(i fe ) = n (^i) ^ u ^ dm : R i n Ah + 0} , 

hence 

dim F m \ Ak = # {j : Rj n A k + 0} < vT x x ■ 2 m U(A k ). (3.15) 

f (i) 1 ^? 

Theorem 3.3 applies conditionally on |JQ | , j < k — 1 with sample of 
size A^ ct and s = log(N/a): to apply the theorem, note that, by definition 

(k) f 

of A k , it is independent of X> i = I... N* c \ Ar guing as in Corollary 
3.1 and using (3.15), we conclude that the following inequality holds with 
probability > 1 — for every fixed k: 

2™ k < C ■ N^+ d . (3.16) 

Let E\ be an event of probability > 1 — a such that on this event bound 
(3.16) holds for every step k, k < L and let £2 be an event of probability 
> 1 — a on which inequalities (3.14) are satisfied. Suppose that event 
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occurs and let ko be a fixed arbitrary integer 2 < ko < L + 1. It is enough 
to assume that Ak -\ is nonempty(otherwise, the bound trivially holds), so 
that it contains at least one cube with sidelength 2~ mfc o- 2 and 

n(i fco _i) > ui2- d ^o-i > (3.17) 

i 

Consider inequality (3.14) with k = k - 1 and 2 m ~ A^-i" % (3-17), we 
have 

For convenience and brevity, denote f2 := supp(Il). Now assumption i? comes 
into play: it implies, together with (3.18) that 

C^lT log - > H^.! - »7|[ ia(ftfco _ > Bafe-i - ^lloc^ni^.! ( 3 - 19 ) 
To bound 

we apply Proposition 3.2. Recall that rhk Q -i depends only on the subsample 
Sfe -i,i but not on S^-i^- Let 

n-Ux®^®}^ ,j<k-l; S k>1 

be the random vector that defines A k and resolution level Note that 
E(% _i(a;)|7fe --i) = % fco _i(a;) Vx a.s. 
Proposition 3.2 thus implies 



drh k i 



/ / 2 amfc o 
Pr max |r/ feo _i(:c) - r/ A (x)| > KU — 



< iVexp 



-t 



% -x j < 

2 



Choosing t = clog(N/a) and taking expectation, the inequality (now uncon- 
ditional) becomes 



/ 2 rfm fco _ 1 l g2(_/y/ Q \ 

Pr I max {f/^ (x) - rj^ (x)\ < K J I > 1-a 

1 sennA*,,-! V ^o-i ' 

(3.20) 
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Let £3 be the event on which (3.20) holds true. Combined, the estimates 
(3. 16), (3. 19) and (3.20) imply that on £ 1 n £ 2 n £3 



\v %)-illoo,nni fcn _i — II 7 ? ? ? fc o-illoo,nni fco _i + nVko-l ^fco-illoo,nni fen _i 



(3.21) 



< (K + C) ■ N k{ ^ lo_ 



N 



where we used the assumption B 2 > log 1 N. Now the width of the confi- 
dence band is defined via 



5 k :=2{K + C)-N k ^+ d \o E 2 - (3.22) 
a 



(in particular, D from Algorithm la is equal to 2{K + C)). With the 
bound (3.21) available, it is straightforward to finish the proof of the claim. 
Indeed, by (3.22) and the definition of the active set, the necessary condition 
for x £ $7 n Ak is 



N 

\ V ( X )\ < 3(k + o ■ N k ;_\ +d 



g 



so that 



u(A k0 ) = n(n n A ko ) < n h^(z)| < 3(k + c) ■ N ko 2 _\ +d log 2 - < 

< BN k 2p + d log 2 ^ - 

by the low noise assumption. This completes the proof of the claim since 
Pr (Si n£ 2 n£ 3 )>l- 3a. □ 

We conclude this section by discussing running time of the active learning 
algorithm. Assume that the algorithm has access to the sampling subroutine 
that, given A C [0, l] d with 11(A) > 0, generates i.i.d. [X^Yj) with Xj ~ 
H(dx\x £ A). 

Proposition 3.3. The running time of Algorithm la(lb) with label bud- 
get N is 

O (dN log 2 N). 
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Remark In view of Theorem 3.4, the running time required to output 
a classifier g such that Rp(g) — R* < e with probability > 1 — a is 

poly ^log V 

Proof. We will use the notations of Theorem 3.4. Let N^ ct be the number 
of labels requested by the algorithm on step k. The resolution level rhk is 
always chosen such that Ak is partitioned into at most dyadic cubes, 
see (3.8). This means that the estimator % takes at most jV^ ct distinct 
values. The key observation is that for any k, the active set Ak+i is always 
represented as the union of a finite number(at most N^ ct ) of dyadic cubes: 
to determine if a cube Rj C Ak+i, it is enough to take a point x £ Rj and 
compare sign(f/fc(x) — 5k) with sign(i]k(x) + 5k)- Rj G Ak+i only if the signs 
are different (so that the confidence band crosses zero level). This can be 
done in 0(N% ct ) steps. 

Next, resolution level rhk can be found in 0(N^ ct log 2 N) steps: there are 
at most log 2 N^ ct models to consider; for each m, inf/ e j- m Pk(Y — f(X)) 2 is 
found explicitly and is achieved for the piecewise-constant 



f( x ) = „ - , ■ x G R 



v 3- 



Sorting of the data required for this computation is done in 0(dN^ ct log N) 
steps for each m, so the whole k-th iteration running time is 0(dN^ ct log 2 ./V). 
Since N k Ct ^ iV ' the result follows. □ 



4. Conclusion and open problems 



We have shown that active learning can significantly improve the quality of 
a classifier over the passive algorithm for a large class of underlying distribu- 
tions. Presented method achieves fast rates of convergence for the excess risk, 
moreover, it is adaptive(in the certain range of smoothness and noise param- 
eters) and involves minimization only with respect to quadratic loss(rather 
than the — 1 loss). 

The natural question related to our results is: 

• Can we implement adaptive smooth estimators in the learning algo- 
rithm to extend our results beyond the case f3 < 1? 
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The answer to this second question is so far an open problem. Our conjecture 

0(1 + 7) 

is that the correct rate of convergence for the excess risk is N 2 P+<t— y(^Ai) ; U p 
to logarithmic factors, which coincides with presented results for /3 < 1. This 
rate can be derived from an argument similar to the proof of Theorem 3.4 
under the assumption that on every step k one could construct an estimator 
fjk with 



At the same time, the active set associated to fjk should maintain some struc- 
ture which is suitable for the iterative nature of the algorithm. Transforming 
these ideas into a rigorous proof is a goal of our future work. 
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Appendix A: Functions satisfying assumption 2 

In the propositions below, we will assume for simplicity that the marginal 
distribution II is absolutely continuous with respect to Lebesgue measure 
with density p(x) such that 



Given t G (0, 1], define A t := {x : \rj(x)\ < t}. 

Proposition A.l. Suppose rj is Lipschitz continuous with Lipschitz con- 
stant S. Assume also that for some t* > we have 



Tf-Vk 




< pi < p(x) < p 2 < oo for all x € [0, l] d 



(A.l) 



(a) n (A u/3 ) > 0; 

(b) rj is twice differentiable for all x G At,; 

(c) M xeAtt ||Vt7(x)||i > s > 0; 

(d) sup xgj 4 t ||-D 2 r/(a;)|| < C < oo where \\ ■ \ 



is the operator norm. 
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Then rj satisfies assumption 2. 

Proof. By intermediate value theorem, for any cube Ri, 1 < i < 2 dm there 
exists xo G Ri such that fj m (x) = rj(xo), x G R4. This implies 

\v( x ) ~ fjm(x)\ = \rj(x) - 7?(x )| = |Vr?(£) • (re - x )| < 
< [|V77(0[|i||s-xo[|oo <S-2~ m 
On the other hand, if Ri C then 
\v{x) ~ fj m (x)\ = \r)(x) - ri(x )\ = 

= |Vr?(x ) • (x - x ) + -[D 2 7](£)](x - x ) • (x - x )\ > 

> |V??(xo) • (x-x )\ - ^sup||£> 2 r/(^)||max||x-rE ||2 > 

(A.2) 

> ^^-ix-x^-c^- 2 ™ 

Note that a strictly positive continuous function 

h(y,u) = J (u- (x - y)) 2 dx 

[0,1] d 

achieves its minimal value /i* > on a compact set [0, l] d x |n G M d : ||u||i = l}. 
This implies(using (A.2) and the inequality (a — b) 2 > ^ — b 2 ) 

Tr\Ri) [ (t](x) - f] m (x)) 2 p(x)dx > 



Ri 

1 



> -(p 2 2 dm )" i J (Vt7(x ) • (x - x ))>dx - C(2~* m > 

Ri 

> - — HVr/Oro)!!^- 2 ™ • ^ - C 2 2- 4m > c 2 2- 2m for m > m Q . 
2p2 

Now take a set A G a{J- m ), m > mo from assumption 2. There are 2 
possibilities: either A C -A^ or ^4 D In the first case the computation 

above implies 



J (r]-fi m ) 2 U(dx\x G A) > c 2 2 



-2m _ ^g22~2m > 



[0,1]" 



> fill _ " II 2 
— Vm\\oo,A 



S. Minsker/ 'Plug-in Approach 25 

If the second case occurs, note that, since {x : < \rj(x)\ < y } has nonempty 
interior, it must contain a dyadic cube R* with edge length 2 _m * . Then for 
any m > max(m-o, m*) 

J (r]-f] m ) 2 U(dx\x G A) > 

[0,1] d 

> U-\A) J ( V -fj m ) 2 U(dx) > |2- 2m n(i?,) > 

R, 

> ^2 n (^*)h-^m|loo,A 

and the claim follows. □ 

The next proposition describes conditions which allow functions to have 
vanishing gradient on decision boundary but requires convexity and regular 
behaviour of the gradient. 

Everywhere below, V77 denotes the subgradient of a convex function 77. 

sup ||V»y(x)||i 

For < ti < t 2 , define G(ti,t 2 ) := m ~ ||V??(x)||i ■ In case wnen Vr X x ) 

xeA t2 \A tl 

is not unique, we choose a representative that makes G(t\,t2) as small as 
possible. 

Proposition A. 2. Suppose r](x) is Lipschitz continuous with Lipschitz con- 
stant S. Moreover, assume that there exists t* > and q : (0, 00) 1— > (0, 00) 
such that A tt C (0, l) d and 

(a) M 7 < n(A t ) < b 2 t^ Vt < U; 

(b) For all < t x < t 2 < U, G{t u t 2 ) < q (^); 

(c) Restriction of n to any convex subset of At, is convex. 

Then n satisfies assumption 2. 

Remark The statement remains valid if we replace n by \rj\ in (c). 
Proof. Assume that for some t < t* and k > 

RcA t \ A t/k 

is a dyadic cube with edge length 2~ m and let xq be such that fj m (%) = 
rj(xo), x G R. Note that n is convex on R due to (c). Using the subgradient 
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inequality rj(x) — t](xq) > Vry(xo) • (x — Xq), we obtain 



(ri(x) - rj(x )rdll(x) > J (r)(x) - 7?(x )) i X{V?7(xo) • (x - x ) > 0} dU(x) 

R R 

> J (V7](xo) • (x - xo)) 2 X{V7?(x ) • (x - x ) > 0} cffl(x) (A.3) 
R 

The next step is to show that under our assumptions xq can be chosen such 
that 

distoo(xo,5i?) > u2- m (A.4) 

where v = v{k) is independent of m. In this case any part of R cut by 
a hyperplane through xo contains half of a ball B(xo,ro) of radius ro = 
v(k)2~ m and the last integral in (A.3) can be further bounded below to get 

(rj(x) - r](x )) 2 dn(x) > ^ J (Vj?(x ) • (x - x )) 2 pxdx > 

R B(x ,r ) 

>c(A:)||Vr ? (xo)||?2- 2m 2- dm (A.5) 

It remains to show (A.4). Assume that for all y such that r](y) = T]( x o) we 
have 

distoo (y,dR) < 52~ m 
for some 5 > 0. This implies that the boundary of the convex set 

{x 6 R : r](x) < rj(xo)} 

is contained in R$ := {x G R : distoo (x, 9i?) < i52 _m }. There are two possi- 
bilities: either {x £ R : r/(x) < r/(xo)} 5 i?\ i?<5 or {x G : r/(x) < ??(xo)} C 

Rs- 

We consider the first case only (the proof in the second case is similar). First, 
note that by (b) for all x G R$ \\Vr)(x)\\i < g(^)||Vry(x )||i and 

V(x) <rj(x ) + \\Vr ] (x)\\ 1 62- m < 

<r ? (x ) + ^)||Vr ? (x )||i52- m (A.6) 

Let x c be the center of the cube R and u - the unit vector in direction 
V?7(x c ). Observe that 

rj(x c + (1 - 35)2~ m u) - r?(x c ) > Vr?(x c ) • (1 - 35)2- m u = 

= (1-35)2^1^^)112 
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On the other hand, x c + (1 - 35)2- m u G R\R S and 
T/(x c +(l-35)2- m u)<7/(«o), 
hence f?(x c ) < 77(2:0) — c(l — 3<5)2~ m ||Vr/(:r c )||i. Consequently, for all 

1 



x G fl(a? c , 5) := <j x : ||x - x c ||oo < ^c2- m (l - 35) 



we have 



T](x) < T)(X C ) + ||Vr/(x c )||i||x - XcHoo < 

< ?7 (xo)-^ C 2- m (l-35)||V ?? (x c )|| 1 (A.7) 

Finally, recall that r) (xq) is the average value of rj on R. Together with 
(A. 6), (A. 7) this gives 

n(R)r)(x ) = Jr](x)dU = Jr](x)dU + J r](x)dU < 
R R s R\R S 

< ( V (x ) + q(k)\\V V (x )\\ 1 62~ m )U(R s )+ 
+ ( V (x ) - c 2 2" m (l - 3J)||Vr ? (x )||i)n (B(x c , 5)) + 
+ V {x )U{R\(R s UB(x c ,S))) = 
= U(R) V (x ) + q(k)\\V V ^o)\\lS2- m U(R s )- 
- c 2 2~ m (l - 35)\\V V {x )\\iU (B(x c ,5)) 

Since U(R 8 ) < p 2 2- dm and U(B(x c ,S)) > c 3 2- dm (l - 35) d , the inequality 
above implies 

c A q(k)5 > (1 - 35) d+1 

which is impossible for small <5(e.g., for 5 < g , fc wg d+4 A 

Let A be a set from condition 2. If A 5 ^t*/3> then there exists a dyadic 

cube R* with edge length 2 _m * such that i?* C A tr / 3 \A t ^/) e for some fc > 0, 

and the claim follows from (A. 5) as in proposition A.l. 

Assume now that A t C A C A 3t and 3t < t*. Condition (a) of the proposition 

implies that for any e > we can choose k(e) > large enough so that 

U(A \ A t/k ) > 11(A) - b 2 (t/ky > 11(A) - b ^k-m(A t ) > (1 - e)U(A) 

(A.8) 
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This means that for any partition of A into dyadic cubes Ri with edge length 
2~ m at least half of them satisfy 

n(ifc \ A t/k ) > (1 - ce)n(Ri) (A.9) 

Let 1 be the index set of cardinality \X\ > cYl(A)2 dm - 1 such that (A.9) is 
true for i £ Z. Since Ri n A t / k is convex, there exists 3 z = z(e) E N such 
that for any such cube Ri there exists a dyadic sub-cube with edge length 
2-{m+z) entirely contained in Ri \A t / k : 

TiCRi\ A t/k c A 3t \ A t/k . 

It follows that Il((jTj) > c(e)H(A). Recall that condition (b) implies 

i 

sup ||Vry(ar)||i 

inf ||VrK*)||i " q[6k) 

Finally, sup ||Vr?(ic)||2 is attained at the boundary point, that is for some 
: = 3t, and by (b) 

sup ||Vt7(x)||i < \/d||Vr/(x*)||i < q(3k)Vd inf \\Vr)(x)\\i. 

x&Azt xEA 3t \A t/k 

Application of (A. 5) to every cube Tj gives 



idX rp 



x) - fj m+z (x)) 2 dU(x) > Cl (k)U(A)\l\ inf ||V77(x)||?2- 2m 2- dm > 

x£A 3t \A t / k 



>c 2 (k)U(A) sup \\Vr,{x)\\i2-' m > c 3 (k)IL(A)\\r, - fj(m)\\^ A 
xeA 3t 

concluding the proof. □ 



Appendix B: Proof of Theorem 3.3 

The main ideas of this proof, which significantly simplifies and clarifies initial 
author's version, are due to V. Koltchinskii. For conveniece and brevity, let 
us introduce additional notations. Recall that 

s m = m(s + loglog 2 N) 

3 If, on the contrary, every sub-cube with edge length 2~^ m+z - ) contains a point from 
A t / k , then A t / k must contain the convex hull of these points which would contradict (A. 8) 
for large z. 
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Let 

TAr(m,s) := Ki — 

2 rfm + g + loglog 2 jV 
ir N (m, s) := K 2 — 

By £p(J r , f) (or £p N (F, /)) we denote the excess risk of / £ F with respect 
to the true (or empirical) measure: 

£ P (T, f) := P(y - f{x)) 2 - inf P{y - g{x)) 2 

gg-F 

£p N {FJ) := P N (y - f(x)) 2 - in] :P N (y - g(x)) 2 

It follows from Theorem 4.2 in Koltchinskii [10] and the union bound that 
there exists an event B of probability > 1 — e~ s such that on this event the 
following holds for all m such that dm < log N: 

£p{Fm, frh) < VTAr(m, s) 

VfeTm, £p{F m ,f) < 2{£p N {T m J) VTT N (m,s)) (B.l) 

3 

V / e T m , £p N (J r m , f) < -(Spi^m, f) V ir N (m, s)). 

We will show that on B, {fh < fh} holds. Indeed, assume that, on the con- 
trary, fh > fh; by definition of fh, we have 

Pn{Y - frhf + T N {m, S) < P N (Y - frn) 2 + T N (fh, s), 

which implies 

£p N (J r m, fm) > r N (rh, s) - T N (fh, s) > 3ir N (m, s) 
for K\ big enough. By (B.l), 

3 

fe'fm' ±Ny ' """-2 \f&* 

and combination the two inequalities above yields 



£p N {FrhJrh) = inf £p N {Tm,f) < o ( jn£ £p{Fm,f) VTT N (m,s) ) , 



inf £p(J r rh,f)>-n-N(m,s) (B.2) 

Since for any m £p(J- m , f) < K(f(X) — rj(X)) 2 , the definition of fh and 
(B.2) imply that 

Tr N (m,s)> inf E(/(X) -7?(X)) 2 >n N (m,s), 
contradicting our assumption, hence proving the claim. 
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